Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3Daware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Project page: https://snap-research.github.io/discoscene/
translated by 谷歌翻译
As a powerful representation of 3D scenes, the neural radiance field (NeRF) enables high-quality novel view synthesis from multi-view images. Stylizing NeRF, however, remains challenging, especially on simulating a text-guided style with both the appearance and the geometry altered simultaneously. In this paper, we present NeRF-Art, a text-guided NeRF stylization approach that manipulates the style of a pre-trained NeRF model with a simple text prompt. Unlike previous approaches that either lack sufficient geometry deformations and texture details or require meshes to guide the stylization, our method can shift a 3D scene to the target style characterized by desired geometry and appearance variations without any mesh guidance. This is achieved by introducing a novel global-local contrastive learning strategy, combined with the directional constraint to simultaneously control both the trajectory and the strength of the target style. Moreover, we adopt a weight regularization method to effectively suppress cloudy artifacts and geometry noises which arise easily when the density field is transformed during geometry stylization. Through extensive experiments on various styles, we demonstrate that our method is effective and robust regarding both single-view stylization quality and cross-view consistency. The code and more results can be found in our project page: https://cassiepython.github.io/nerfart/.
translated by 谷歌翻译
Learning physical systems on unstructured meshes by flat Graph neural networks (GNNs) faces the challenge of modeling the long-range interactions due to the scaling complexity w.r.t. the number of nodes, limiting the generalization under mesh refinement. On regular grids, the convolutional neural networks (CNNs) with a U-net structure can resolve this challenge by efficient stride, pooling, and upsampling operations. Nonetheless, these tools are much less developed for graph neural networks (GNNs), especially when GNNs are employed for learning large-scale mesh-based physics. The challenges arise from the highly irregular meshes and the lack of effective ways to construct the multi-level structure without losing connectivity. Inspired by the bipartite graph determination algorithm, we introduce Bi-Stride Multi-Scale Graph Neural Network (BSMS-GNN) by proposing \textit{bi-stride} as a simple pooling strategy for building the multi-level GNN. \textit{Bi-stride} pools nodes by striding every other BFS frontier; it 1) works robustly on any challenging mesh in the wild, 2) avoids using a mesh generator at coarser levels, 3) avoids the spatial proximity for building coarser levels, and 4) uses non-parametrized aggregating/returning instead of MLPs during pooling and unpooling. Experiments show that our framework significantly outperforms the state-of-the-art method's computational efficiency in representative physics-based simulation cases.
translated by 谷歌翻译
创建和编辑3D对象的形状和颜色需要巨大的人类努力和专业知识。与3D接口中的直​​接操作相比,诸如草图和涂鸦之类的2D交互对用户通常更自然和直观。在本文中,我们提出了一个通用的多模式生成模型,该模型通过共享的潜在空间耦合2D模式和隐式3D表示。通过提出的模型,通过简单地通过潜在空间从特定的2D控制模式传播编辑,可以实现多功能3D生成和操纵。例如,通过绘制草图来编辑3D形状,通过绘画颜色在2D渲染上重新色彩,或者在一个或几个参考图像中生成特定类别的3D形状。与先前的作品不同,我们的模型不需要每个编辑任务进行重新训练或微调,并且在概念上也很简单,易于实现,对输入域移动的强大,并且可以在部分2D输入中进行多样化的重建。我们在灰度线草图和渲染颜色图像的两种代表性2D模态上评估了我们的框架,并证明我们的方法可以通过以下2D模态实现各种形状的操纵和生成任务。
translated by 谷歌翻译
我们提出了Dance2Music-Gan(D2M-GAN),这是一种新颖的对抗性多模式框架,生成了以舞蹈视频为条件的复杂音乐样品。我们提出的框架将舞蹈视频框架和人体运动作为输入,并学会生成合理伴随相应输入的音乐样本。与大多数现有的有条件音乐的作品不同,它们使用符号音频表示(例如MIDI)生成特定类型的单乐器声音,并且通常依赖于预定义的音乐合成器,在这项工作中,我们以复杂风格(例如,例如,通过使用量化矢量(VQ)音频表示形式,并利用其符号和连续对应物的高抽象能力来利用POP,BREAKING等)。通过在多个数据集上执行广泛的实验,并遵循全面的评估协议,我们评估了建议针对替代方案的生成品质。所达到的定量结果衡量音乐一致性,击败了对应和音乐多样性,证明了我们提出的方法的有效性。最后但并非最不重要的一点是,我们策划了一个充满挑战的野生式Tiktok视频的舞蹈音乐数据集,我们用来进一步证明我们在现实世界中的方法的功效 - 我们希望它能作为起点进行相关的未来研究。
translated by 谷歌翻译
关于神经辐射场(NERF)的最新研究爆炸表明,具有神经网络的复杂场面具有令人鼓舞的潜力。 NERF的一个主要缺点是它的推理时间:渲染单像素需要数百次查询NERF网络。为了解决它,现有的努力主要试图减少所需的采样点的数量。但是,迭代采样的问题仍然存在。另一方面,神经光场(NELF)在新型视图合成中对NERF提出了更直接的表示 - 像素的渲染相当于一个单一的正向通行,而无需射线建设。在这项工作中,我们提出了一个深层残留的MLP网络(88层),以有效地学习光场。我们展示了成功学习这种深度NELF网络的关键,就是拥有足够的数据,我们通过数据蒸馏从预训练的NERF模型中转移知识。在合成和现实世界场景上进行的广泛实验表明,我们方法比其他对应算法的优点。在合成场景中,我们实现了26-35倍的拖鞋(每个摄像头射线)和28-31倍的运行时加速,同时提供了比NERF的呈现质量(1.4-2.8 dB的平均PSNR改善),而无需任何定制的并行性要求。
translated by 谷歌翻译
我们提出了一种新的方法来获取来自在线图像集合的对象表示,从具有不同摄像机,照明和背景的照片捕获任意物体的高质量几何形状和材料属性。这使得各种以各种对象渲染应用诸如新颖的综合,致密和协调的背景组合物,从疯狂的内部输入。使用多级方法延伸神经辐射场,首先推断表面几何形状并优化粗估计的初始相机参数,同时利用粗糙的前景对象掩模来提高训练效率和几何质量。我们还介绍了一种强大的正常估计技术,其消除了几何噪声的效果,同时保持了重要细节。最后,我们提取表面材料特性和环境照明,以球形谐波表示,具有处理瞬态元素的延伸部,例如,锋利的阴影。这些组件的结合导致高度模块化和有效的对象采集框架。广泛的评估和比较证明了我们在捕获高质量的几何形状和外观特性方面的方法,可用于渲染应用。
translated by 谷歌翻译
我们呈现剪辑NERF,一种用于神经辐射字段(NERF)的多模态3D对象操纵方法。通过利用近期对比语言图像预培训(剪辑)模型的联合语言图像嵌入空间,我们提出了一个统一的框架,它允许以用户友好的方式操纵nerf,使用短文本提示或示例图像。具体地,为了结合NERF的新型视图合成能力以及从生成模型的潜在表示的可控操纵能力,我们引入了一种允许单独控制形状和外观的脱屑的条件NERF架构。这是通过通过将学习的变形字段应用于对体积渲染阶段的位置编码和延迟颜色调节来实现的来实现。要将这种解除潜在的潜在潜在表示到剪辑嵌入,我们设计了两个代码映射器,将剪辑嵌入为输入并更新潜在码以反映目标编辑。用基于剪辑的匹配损耗训练映射器,以确保操纵精度。此外,我们提出了一种逆优化方法,可以将输入图像精确地将输入图像投影到潜在码以进行操作以使在真实图像上进行编辑。我们在各种文本提示和示例图像上进行广泛的实验评估我们的方法,并为交互式编辑提供了直观的接口。我们的实现是在https://cassiepython.github.io/clipnerf/上获得的
translated by 谷歌翻译
由于其语义上的理解和用户友好的可控性,通过三维引导,通过三维引导的面部图像操纵已广泛应用于各种交互式场景。然而,现有的基于3D形式模型的操作方法不可直接适用于域名面,例如非黑色素化绘画,卡通肖像,甚至是动物,主要是由于构建每个模型的强大困难具体面部域。为了克服这一挑战,据我们所知,我们建议使用人为3DMM操纵任意域名的第一种方法。这是通过两个主要步骤实现的:1)从3DMM参数解开映射到潜在的STYLEGO2的潜在空间嵌入,可确保每个语义属性的解除响应和精确的控制; 2)通过实施一致的潜空间嵌入,桥接域差异并使人类3DMM适用于域外面的人类3DMM。实验和比较展示了我们高质量的语义操作方法在各种面部域中的优越性,所有主要3D面部属性可控姿势,表达,形状,反照镜和照明。此外,我们开发了直观的编辑界面,以支持用户友好的控制和即时反馈。我们的项目页面是https://cassiepython.github.io/cddfm3d/index.html
translated by 谷歌翻译
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction and planning. As sensors and hardware get improved, there is trending popularity to devise a system that can perform a wide diversity of tasks to fulfill higher-level intelligence. Contemporary approaches resort to either deploying standalone models for individual tasks, or designing a multi-task paradigm with separate heads. These might suffer from accumulative error or negative transfer effect. Instead, we argue that a favorable algorithm framework should be devised and optimized in pursuit of the ultimate goal, i.e. planning of the self-driving-car. Oriented at this goal, we revisit the key components within perception and prediction. We analyze each module and prioritize the tasks hierarchically, such that all these tasks contribute to planning (the goal). To this end, we introduce Unified Autonomous Driving (UniAD), the first comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query design to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven to surpass previous state-of-the-arts by a large margin in all aspects. The full suite of codebase and models would be available to facilitate future research in the community.
translated by 谷歌翻译